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ABSTRACT 

In this paper we give a short theoretical description of 
the general predictive adaptive arithmetic coding technique. 
The links between this technique and the works of J. Ris- 
sanen in the 80' s, in particular the BIC information cri- 
terion used in parametrical model selection problems, are 
established. We also design lossless and lossy coding tech- 
niques of images. The lossless technique uses a mix between 
fixed-length coding and arithmetic coding and provides bet- 
ter compression results than those separate methods. That 
technique is also seen to have an interesting application in 
the domain of statistics since it gives a data-driven proce- 
dure for the non-parametrical histogram selection problem. 
The lossy technique uses only predictive adaptive arithmetic 
codes and shows how a good choice of the order of predic- 
tion might lead to better results in terms of compression. We 
illustrate those coding techniques on a raw grayscale image. 

1. INTRODUCTION 

Arithmetic Coding (AC) is an efficient binary coding 
technique. We use it here in one of its most general form : 
the predictive and adaptive one. Even though those aspects 
of AC are known, it is quite hard to find literature dealing 
with both of them; as well as to determine which aspects 
are actually used in image coding norms such as JPEG and 
JPEG2000. We try here to answer the first issue but could 
not collect useful informations about the second. This pa- 
per does not seek compression efficiency but wants to show 
how different AC processes may be used in both parametri- 
cal (© and non-parametrical (© model selection problems. 
This explains why we choose to work on raw images. 

After a description of AC algorithm in ^2] we take a 
closer look at the resulting codelength. To this end, we use 
works of J. Rissanen in Sill and especially [8|. The main 
conclusion of §|3] is that the codelength enters the family of 
information criteria, a widely used tool in the vast problem of 
model selection. We aim at showing that the adaptive aspect 
of the AC used here is an essential feature. 

Next, we design in §|4] a new lossless coding technique. 
It uses a mix between AC, which is compression effici ent, 
and fixed-length coding, which is not. It is shown in jj4.2l that 
correctly mixing those two methods gives better compression 
efficiency than using only AC. The most important parameter 
to be adjusted in order to get that "correct" mix is the order of 
prediction. Moreover, that method is shown in {34.3l to have a 
direct application in the histogram selection problem. 

Finally we design in ^5] a lossy coding technique which, 
once again, shows the importance of the order of prediction. 



2. GENERALITIES ON ARITHMETIC CODING 

2.1 Multiple Markov Chain 

The notion of Multiple Markov Chain (MMC) leads to 
arithmetic coding. Let E = {a\,... ,a m } be a finite set with 
m elements. An ^-valued process (X„)„ e N* is an order k 
MMC if k € N is the smallest integer satisfying the law equa- 
lity P(X n |X„_i, . . . ,Xi) = P(X„|X„_i, . . . ,X„_*) for all n. We 
will always work in the case where that law does not depend 
on n ; the chain is said homogeneous. An order MMC is a 
sequence of independent random variables. 

If X is an order k MMC, we will suppose that X\, . . . ,X^ 
are independent and uniformly distributed on E. For i 6 E a 
state and j s E k a multiple state, we denote by B(i\j) the pro- 
bability to see i after /'. Consequently, choosing (in — \)m k 
real numbers B(i\j) for j S E k and i S {a\ 1 . . . ,a,„_i} is en- 
ough for describing the evolution of X. Let 9 denote such a 
parameter and x" =x\,... ,x„ be a sequence of elements of 
E, the likelihood of x" relatively to 9 writes as : 

where n(i\j) is the number of occurences of i after j in x". 

2.2 Predictive adaptive arithmetic coding : PAAC 

We deal here with a general AC which is both k- 
predictive and adaptive; we shorten it to fc-PAAC. Predic- 
tive means we code using orders k that may be greater than 
1, hence a prediction of the future state of the chain from 
the current state. Adaptive means we do not need any prior 
knowledge on the chain, except its order ; we learn how to 
predict the future step by step. Both notions have been for- 
mally introduced and studied by Rissanen (6J [7] 0. For a 
more concrete description of arithmetic coding, we refer to 
ifTTIl ; note that this paper does not mention the predictive as- 
pect. Let us now give a theoretical description of the general 
fc-PAAC algorithm. 

Letx" =X\, . . . ,x n be a chain of elements of E to be enco- 
ded and I c be the current interval firstly set to I c = [0, 1). For 
n > t > 1 we note x 1 =xi,...,x,. The only prior we need is an 
order of coding k > 0, then the algorithm works as follows. 

Suppose that the t > first symbols are dealt with ; t = 
means we have not started the coding yet. To deal with 
the (/ + l)-th symbol we actualize transition probabilities as 
follows : 

nM(i\j) + 1 



mm 



where i EE, j E E k , rS'\i\j) and n^(j) denote the respec- 
tive number of occurences of i after j and of j in the chain x* ; 

(j) must not count an occurence of j at the very end of 
that chain. If k = 0, the multiple states j vanish and we set 
nW(j) = f. Those probabilities reflect what we know of the 
chain at the time t of the coding process ; they are the adap- 
tive aspect. We then set j — x t -k+li ■ ■ ■ ,x, the current state 
and split the current interval I c in m smaller intervals accor- 
ding to the probabilities W>(i\j), i G E. This way, we asso- 
ciate to each possible future state i E E an interval whose 
length is proportional to the probability with which we ex- 
pect it. The (f + l)-th symbol is dealt with by choosing for 
new I c the interval corresponding to i = x t+ \. 

Once the last symbol x n has been dealt with, we are left 
with an interval I c = [low,high). Let [.] denote the super- 
ior integer part, there exists two consecutive dyadic numbers 
with length [~— log(high-low)] in I c . We take as the arithme- 
tic code of x" the sequence of bits given by the fractionnal 
part of the biggest one. If encoder and decoder agree on the 
order k of coding, that sequence of bits is decodable, we refer 
again to IfTTII . 

For illustration in tableQ] we take m = 2, E = {a,b} and 
encode x 4 — abaa at order k = 1 . In the splits, we allow the 
left interval to a. 



Tab. 1 - Order 1 PAAC of the chain abaa. 
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This example shows the following general fact about k- 
PAAC : the more unexpected behaviours occur in the chain, 
the smaller is the last I c , the longer is the code. For instance at 
step t — 4 we expected b with probability 2/3, and observed a. 
This caused us to choose the small interval I c = [1/4,7/24). 
For comparison, if b had occured the code would have been 
0110 which is 1 bit shorter. This leads us to the notion of 
information criteria (IC). 

3. INFORMATION CRITERIA 

Let us show how the PAAC may be used to solve a model 
selection problem being : if x" is a realisation of an unknown 
MMC ( §12. 11 1, which is its order ? More precisely, we will see 
how the adaptive aspect of the PAAC is involved. 

3.1 Coding approach of the model selection problem 

As mentionned earlier the &-PAAC length of x", say 
L(x"\k), is ruled by the unexpected events in x" : the more 



unexpected events, the longer the code. Consequently, if x" 
is ruled by an unknown order k* MMC and we try to £-PAAC 
it at an order k ^ k*, many unexpected events might occur : 
either because k < k* and we do not look far enough in the 
past, or because k > k* and we take into account informa- 
tions relative to a too far away past which has actually no 
influence on the future. Thus the minimization of L(x" \k) is 
an appropriate tool for seeking k*. 

The works of Rissanen will confirm that idea and esta- 
blish a link with Information Criteria (IC). 

3.2 Rissanen's result 

In JS] it is shown that L(x n \k) asymptotically behaves as : 

~ (in — 1 \m k 
BIC(j("|jfc) = -\ogV(x"\Q k )+ y - -^logn (2) 

where 6^ is the maximum likelihood (ML) estimator of order 
k for x", i.e. the parameter that maximizes (Q}. 

BIC stands for Bayesian Information Criterion and enters 
the formalism of IC first introduced by Akaike [ 1 ] ; let us 
mention fW\ [5] [3] in addition to fl] [8) as important steps in 
the theory of IC. 

Here is the idea behind IC : the first term of the criterion 
(0, referred to as the ML term, decreases as k grows. This is 
mainly because the ML estimator fits the datas more accu- 
rately if we let him look far away in the past. This phenomena 
is known as overparametrization and is the major problem to 
be solved in model selection, it appears on figure Q] On the 
other hand, the second term, the penalty, increases as k grows 
due to (m — 1 )m k which is the number of free parameters in 
the MMCs model of order k. Therefore, the minimization of 
IC over k realizes a balance between the data fitting, measu- 
red by the ML term, and the complexity of the model needed 
to obtain such a fitting, measured by the penalty. 

The quantity BIC(*"|&) is much faster to compute than 
L(x n \k) ; the encoder should use BIC before encoding to find 
which order will achieve the minimum codelength. 

One can design a non-adaptive order ^-predictive arith- 
metic coding process whose codelength would be exactly 
|~— logP(jc"|8jt)] = [ML] . However, this process requires to 
send the parameter 8^ for decodability and, especially, it no 
longer answers the problem of order selection since ML suf- 
fers the overparametrization issue. In terms of IC, the adap- 
tive aspect of the process creates the penalty term which 
avoids overparametrization, see again figure Q] 

3.3 Comparison of actual codings with criterion 

We generate a realization x n of an order k* = 5 MMC 
with m = 2 and n = 25000. For k = 0, ... ,10 we encode 
it with fc-PAAC process. We also compute the criterion 
BIC(jE"|Jfc) and the quantity ML = -logP(x"|8fc). Results are 
presented on figure Q] divided by n to express them as a bit- 
rate. 

As expected, BIC and £-PAAC curves present a minimum 
at k — k* while the ML method overparametrizes at k = 9. 

Note that, when computing BIC, it is desirable to have 
enough observations compared to the number of free para- 
meters, empirically : 

n w a(m -l)m k with a > 20 (3) 
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FlG. 1 - Superposition of codelengths and criteria. 



would be good. If n is too small behind the number of tran- 
sition probabilities to be estimated, those transitions do not 
occur often in the chain and their estimation is weak, resul- 
ting in the penalty to dominate the ML term. An alternative 
would be to compute the number of transitions actually ob- 
served in the chain and plug them in (fJJ instead of (m — l)m k . 



4. LOSSLESS CODING OF RAW IMAGES 

Let Jp, q\ be the set of integers from p to q. Let us choose 
an r x c greyscale image and set n = rc. Firstly, the image has 
to be turned into a vector x" G I". For order k > 1 codings, 
the way this linearization is done does matter since one does 
not want to lose proximity information on the pixels. We have 
chosen the "zigzag" linearization used in 8 x 8 blocks of DCT 
transform in JPEG norm [12|. Other transformations have 
been tested and results are quite similar. Let us now describe 
our lossless coding method. 

4.1 Lossless coding method 

It is a two-part coding technique. In first, choose a par- 
tition P of / = [0,255] ; that is a set of m disjoined intervals 
(I j) jell , m ] whose union is /. Then, from x", form a new chain 
y n as follows : 

m 

Vi6[l,B] s y / =£jl/ 7 (jt i ). (4) 

.7=1 

That is, each yt denotes the number of the interval of P in 
which Xi falls. The chain y" has values in E = [l,m]|. For k 
an order, we denote by L(y n \k,P) its fc-PAAC codelength. If 
m= 1, we set L(f\k,P) to 0. 

Secondly, we denote by Aj the number of integers in Ij. 
Once yi = j is known one needs, in order to recover x, £ Ij, to 
specify which one of those integers Xi actually is. This is done 
for each jc, 6 by a simple code with fixed length [logA/] . 
Therefore, the number of bits required to recover x" from y" 
isL(jt»|/)=E7 =1 n;[logA/|. 

For decodability, one should also send the partition cho- 
sen to encode. We do not take this into account here since the 
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FlG. 2 - Lossless estimated bit-rates of Lena at order 0,1,2. 



codelength required to this end is very small compared to the 
quantities L(y"\k, P) and L(x" \y") we work on. 

Let us note L(x"\k,P) := L{f\k,P) +L(x"\y") the total 
lossless codelength of x" with help of the partition P. 

4.2 Choice of partition and order of prediction 

As m grows L(y"\k,P) also grows because y" has values 
in By opposition L(x n \y n ) decreases since the inter- 

vals Ij get smaller. Consequently, there should exist a parti- 
tion P which balances those two phenomena by minimizing 
the codelentgh L(x"\k,P). This argument takes place in the 
theory of Minimum Description Length (MDL) introduced 
by Rissanen and for which we refer to Grunwald and al. J4). 

We estimate L(y"\k,P) by BIC(f \k), see O We then de- 
fine the following criterion as an estimation of the lossless 
order k coding of x" with the partition P : 

CRlT{x"\k,P) = BlC{y"\k)+L{x"\y"). (5) 

We restrict ourselves to regular partitions ; i.e. partitions 
P(m) whose intervals all have length 256/w. We work with 
the 512x512 greyscale Lena image. 

Figure [2] presents, for m ranging from 1 to 256 the esti- 
mated bit-rate CRn{x n \k 1 P{m))/n fork = 0, 1,2. For k=\, 
the condition (f3]) is satisfied for m up to 115 but we still give 
the k = 1 curve up to m — 256 for completeness. The algo- 
rithm complexity increases considerably with the order k and 
computations for k > 2 shows no significant improvements ; 
in the case k = 2 we went up to m = 30 which makes a about 
10. 

Note that our coding technique with P(l) is equivalent to 
the pgm formafl. In the other extreme case, with P(256) we 
get y" = x" and L(x n \y n ) = ; this means we directly encode 
the chain x" with the fc-PAAC process. Considering this, fi- 
gure|2]shows how a mix of those two methods leads to better 
bit-rates. The minimization of the criterion (O tells us which 
partition is to be chosen in order to get the correct mix. 

More important, 1-PAAC is clearly seen to reaches better 
bit-rates than 0-PAAC : roughly 7 bpp with huge P(200) par- 
tition for 0-PAAC against 5.4 bpp with P(50) for 1-PAAC. 
Note that the order k chosen for the coding process only af- 
fects the first term BIC(y|fc) of the criterion ©, hence we 

1 http ://www.imagemagick.org/script/formats.php 
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Fig. 3 - Laplace distribution and histogram chosen by (©. 

may also give the following interpretation of the curves in 
figure |2] : no matter how we quantize them via a partition, 
the grey scales in our image should not be considered inde- 
pendent but rather of order 1. Unsurprisingsly, that depen- 
dance of a pixel greyscale on its neighboors may be shown 
this way on most of common images which content is com- 
prehensible by the human brain. 

4.3 Histogram selection statistical problem 

It is interesting to note that the criterion (f5]l may be 
directly extended to the histogram selection statistical pro- 
blem : if / is an unknown density on an interval / and x n is a 
sample from this density, which partition of / is to be chosen 
for building an histogram estimator of / ? 

For such a partition P, by independence of x" and formula 
©, it is readily seen that the y,-'s are independent so that 
the 0-PAAC of y n will be the best. Let us denote by Lj the 
length of Ij and suppose that each Ij contains a number of 
real numbers proportional to Lj. Then, up to terms which do 
not depend on P and after little calculations, the estimated 
lossless order codelength of x" using P is : 

CRIT(x M |0,P) = BIC(/\0)+L(x"|/). 

'" n ■ m—1 

CRIT(x"|0,P) = -£ ra; log-^ + — — logn. (6) 

This criterion is in shape really similar to the one used by 
Birge and al. in [2 1 except it has a coding background which 
justifies its use. Moreover it is not restricted to regular parti- 
tions of /. If / is supposed to contain R real numbers, there 
could be 2 R ~ 1 partitions to be tested, which is huge. Rissanen 
and al. presented in |9j a dynamic programing method which 
shrinks to 0(R 2 ) the number of computations required to find 
which one of the 2 R ~ l partitions achieves the minimum of 
©. For illustration, we present in figure[3]the partition cho- 
sen on a 2000-sample from the Laplace distribution used to 
represent DCT coefficients in the JPEG norm. We assume 
that/= [-5,5] and/? = 200. 

5. LOSSY CODING OF RAW IMAGES 

We keep the same linearization as in ^4]to turn an image 
into a vector x" and now describe our lossy coding method. 
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Fig. 4 - Estimated Lena's bit-rates for 0-PAAC and 1-PAAC. 



5.1 Lossy coding method 

For P a partition of [0, 255] in m intervals, we define the 
[[l,m]]- valued chain y" as in (01). Next, we quantize the da- 
tas x" on P at their barycenter. That is, for each j 6 [[l,w]], 
we consider all x, 's falling into Ij, compute their barycenter, 
round it to the closest integer Bj and finally set all those x, 's 
to Bj. This gives a new image with only m grey levels, this is 
where the loss occurs. Moreover, that quantization creates an 
injective map : 

B: [Lm] — > [[0,255]] 

With the help that map, the decoder is able to reconstruct 
the quantized image from only the chain y" ; therefore B is 
to be sent. However, the coding of such a map is very short 
compared to the codelength of the chain y", so we drop it. 

Now we are left to encode y" with the &-PAAC process, 
hence the estimation of the lossy codelength of our image by 
the BIC criterion © : 

BIC<y»|Jt) = -logP(y"]^)+ (m ~ 1)m ' logn. 

5.2 Influence of the order on bit-rates 

We still restrict ourselves to regular partition P(m) and 
work with Lena. Figure [4] presents the estimated bit-rates 
BlC(y"\k)/n form ranging from 1 to 256 and orders k = 0, 1. 
For any m, the fact that the k = 1 curve is under the k = 
curve means, as in $4]and via IC interpretation, that the chain 
y n is of order 1 rather than order 0. 

5.3 Comparison involving distortion 

Each value of m brings a certain quantization, thus a cer- 
tain distortion. We measure this distortion by the Peak Signal 
to Noise Ratio (PSNR) and plot it against the corresponding 
bit-rate of 0-PAAC and 1-PAAC in figure [5] For illustration, 
we present in figure [6] the two quantized Lena images ob- 
tained for m = 3 and m — 13 with their respective PSNR. 
We also give bit-rates achieved by 0-PAAC and 1 -PAAC on 



each of those image. For instance, this shows that at an im- 
posed rate of about 1.4 bpp, the 1-PAAC allows to encode 
Lena with a PSNR of 33.15 dB while the 0-PAAC only gives 
22.11 dB. 
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FIG. 5 - Estimated Lena's bit-rates/PSNR for and 1-PAAC. 




0- PAAC : 1 . 36 bpp 0-PAAC : 3 . 1 8 bpp 

1- PAAC : 0.43 bpp 1-PAAC : 1.39 bpp 

Fig. 6 - Estimated PSNR and bit-rates on Lena quantized at 
m = 3 and m = 13 levels for 0-PAAC and 1-PAAC. 



6. PERSPECTIVES 

As mentionned in the introduction we did not provide ef- 
ficient compression results by intentionally working on raw 
images. Therefore it would be interesting to insert the dis- 
cussed binary coding methods after, for instance, the wavelet 
transform block of the JPEG2000 norm. In order to com- 
press, one should in first determine with the BIC criterion 
(f2|i the order of the sequence of wavelet coefficients and then 
use the criterion (O to determine the partition which allows 
to encode those coefficients efficiently. 
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